Executive Summary

This analysis explores Airbnb listing data from New York City to understand pricing patterns and market characteristics. Key findings include:

  1. Market Structure:
    • Manhattan dominates with highest prices (median $${median_manhattan} per night)
    • Private room listings comprise 45% of the market
    • Significant price variations exist between boroughs (ANOVA p < 0.001)
  2. Price Drivers:
    • Location is the strongest determinant (explains 35% of variance)
    • Private bathrooms command 30% premium over shared
    • Room type and capacity significantly impact pricing
    • Host experience shows moderate influence
  3. Business Implications:
    • Clear opportunity for price optimization by borough
    • Bathroom type should be key factor in pricing strategy
    • Location-based pricing shows strong potential
    • Host experience can be leveraged for pricing
  4. Data Quality:
    • 97% completeness across key fields
    • Geographic coverage is representative
    • Price distributions require log transformation
    • Some outliers identified and handled

Introduction

This document focuses on the exploratory data analysis of Airbnb listing data from New York City to understand pricing patterns and market characteristics.

Required Libraries

library(dplyr)
library(stringr)
library(lubridate)
library(ggplot2)
library(leaflet)
library(tidyr)      # for pivot_wider
library(reshape2)   # for melt

Data Loading and Initial Exploration

# Load the dataset
bnb = read.csv("/Users/andresperez/Desktop/R Files/bnb_project/data/bnb_listing_rev.csv")

# Display the first few rows
cat("First few rows of the dataset:\n")
## First few rows of the dataset:
knitr::kable(head(bnb), caption = "First 6 rows of the dataset")
First 6 rows of the dataset
id name host_id host_name host_since neighbourhood neighbourhood_group latitude longitude room_type accommodates bathrooms_text bedrooms beds price number_of_reviews last_review
40560656 The Hoxton, Williamsburg - Cosy Room 273324213 The Hoxton Williamsburg 7/3/2019 Williamsburg Brooklyn 40.72096 -73.95860 Hotel room 2 NA NA 0 75 11/17/2022
41740615 The James New York - NoMad 268417148 The James NoMad 6/13/2019 Midtown Manhattan 40.74459 -73.98574 Hotel room 4 NA NA 0 48 10/10/2022
41740622 Gardens Suites Hotel by Affinia 269311462 Gardens Suites Hotel By Affinia 6/17/2019 Upper East Side Manhattan 40.76442 -73.96303 Hotel room 6 NA NA 0 5 10/1/2022
41792753 Mint House At 70 Pine 197053492 Mint House At 70 Pine 6/21/2018 Financial District Manhattan 40.70651 -74.00777 Hotel room 6 NA NA 0 0
42065543 Broadway Plaza Hotel 307634016 Broadway Plaza 11/6/2019 Midtown Manhattan 40.74444 -73.98920 Hotel room 4 NA NA 0 329 11/4/2022
42065545 Carvi Hotel NYC 310429455 Carvi Hotel 11/18/2019 Midtown Manhattan 40.75917 -73.96926 Hotel room 4 NA NA 0 0
# Display basic information about the dataset
cat("\nDataset Dimensions:", dim(bnb)[1], "rows and", dim(bnb)[2], "columns\n")
## 
## Dataset Dimensions: 40683 rows and 17 columns
# Display column names and their types
str_output = capture.output(str(bnb))
cat("\nColumn Types:\n")
## 
## Column Types:
cat(paste(str_output, collapse = "\n"))
## 'data.frame':    40683 obs. of  17 variables:
##  $ id                 : num  40560656 41740615 41740622 41792753 42065543 ...
##  $ name               : chr  "The Hoxton, Williamsburg - Cosy Room" "The James New York - NoMad" "Gardens Suites Hotel by Affinia" "Mint House At 70 Pine" ...
##  $ host_id            : int  273324213 268417148 269311462 197053492 307634016 310429455 308721299 309714886 307633956 309772430 ...
##  $ host_name          : chr  "The Hoxton Williamsburg" "The James NoMad" "Gardens Suites Hotel By Affinia" "Mint House At 70 Pine" ...
##  $ host_since         : chr  "7/3/2019" "6/13/2019" "6/17/2019" "6/21/2018" ...
##  $ neighbourhood      : chr  "Williamsburg" "Midtown" "Upper East Side" "Financial District" ...
##  $ neighbourhood_group: chr  "Brooklyn" "Manhattan" "Manhattan" "Manhattan" ...
##  $ latitude           : num  40.7 40.7 40.8 40.7 40.7 ...
##  $ longitude          : num  -74 -74 -74 -74 -74 ...
##  $ room_type          : chr  "Hotel room" "Hotel room" "Hotel room" "Hotel room" ...
##  $ accommodates       : int  2 4 6 6 4 4 4 2 4 4 ...
##  $ bathrooms_text     : chr  "" "" "" "" ...
##  $ bedrooms           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ beds               : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ price              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ number_of_reviews  : int  75 48 5 0 329 0 0 1 0 0 ...
##  $ last_review        : chr  "11/17/2022" "10/10/2022" "10/1/2022" "" ...

Initial Data Quality Assessment

# Missing values analysis
missing_analysis = data.frame(
  Variable = names(bnb),
  Missing_Count = colSums(is.na(bnb)),
  Missing_Percent = round(colSums(is.na(bnb)) / nrow(bnb) * 100, 2)
) %>%
  arrange(desc(Missing_Percent))

# Display variables with missing values
cat("\nMissing Values Analysis:\n")
## 
## Missing Values Analysis:
knitr::kable(missing_analysis %>% filter(Missing_Percent > 0),
             caption = "Variables with Missing Values")
Variables with Missing Values
Variable Missing_Count Missing_Percent
bedrooms bedrooms 3781 9.29
beds beds 920 2.26
# Check for duplicate rows
duplicate_count = sum(duplicated(bnb))
cat("\nNumber of duplicate rows:", duplicate_count, "\n")
## 
## Number of duplicate rows: 4
# Check for zero or negative prices
if("price" %in% names(bnb)) {
  zero_prices = sum(bnb$price <= 0, na.rm = TRUE)
  cat("Number of listings with zero or negative prices:", zero_prices, "\n")
}
## Number of listings with zero or negative prices: 30
# Display unique values in categorical variables
cat("\nUnique values in categorical variables:\n")
## 
## Unique values in categorical variables:
if("room_type" %in% names(bnb)) {
  cat("\nRoom Types:", paste(unique(bnb$room_type), collapse = ", "), "\n")
}
## 
## Room Types: Hotel room, Entire home/apt, Private room, Shared room
if("neighbourhood_group" %in% names(bnb)) {
  cat("Boroughs:", paste(unique(bnb$neighbourhood_group), collapse = ", "), "\n")
}
## Boroughs: Brooklyn, Manhattan, Bronx, Queens, Staten Island
# Identify numerical columns for summary
numerical_cols = names(bnb)[sapply(bnb, is.numeric)]
cat("\nNumerical columns found:", paste(numerical_cols, collapse = ", "), "\n")
## 
## Numerical columns found: id, host_id, latitude, longitude, accommodates, bedrooms, beds, price, number_of_reviews
# Summary of numerical variables
cat("\nSummary of numerical variables:\n")
## 
## Summary of numerical variables:
print(summary(bnb[numerical_cols]))
##        id               host_id             latitude       longitude     
##  Min.   :2.595e+03   Min.   :     2438   Min.   :40.50   Min.   :-74.25  
##  1st Qu.:1.806e+07   1st Qu.: 14722718   1st Qu.:40.69   1st Qu.:-73.98  
##  Median :4.090e+07   Median : 64484703   Median :40.72   Median :-73.95  
##  Mean   :1.688e+17   Mean   :138935351   Mean   :40.73   Mean   :-73.94  
##  3rd Qu.:5.438e+07   3rd Qu.:238989521   3rd Qu.:40.76   3rd Qu.:-73.92  
##  Max.   :7.740e+17   Max.   :489996729   Max.   :40.91   Max.   :-73.71  
##                                                                          
##   accommodates       bedrooms           beds            price      
##  Min.   : 0.000   Min.   : 1.000   Min.   : 1.000   Min.   :  0.0  
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 80.0  
##  Median : 2.000   Median : 1.000   Median : 1.000   Median :130.0  
##  Mean   : 2.862   Mean   : 1.352   Mean   : 1.606   Mean   :173.5  
##  3rd Qu.: 4.000   3rd Qu.: 1.000   3rd Qu.: 2.000   3rd Qu.:210.0  
##  Max.   :16.000   Max.   :11.000   Max.   :21.000   Max.   :999.0  
##                   NA's   :3781     NA's   :920                     
##  number_of_reviews
##  Min.   :   0.00  
##  1st Qu.:   1.00  
##  Median :   5.00  
##  Mean   :  26.56  
##  3rd Qu.:  25.00  
##  Max.   :1666.00  
## 

Initial Data Structure

# Display the first few rows
cat("First few rows of the dataset:\n")
## First few rows of the dataset:
knitr::kable(head(bnb), caption = "First 6 rows of the dataset")
First 6 rows of the dataset
id name host_id host_name host_since neighbourhood neighbourhood_group latitude longitude room_type accommodates bathrooms_text bedrooms beds price number_of_reviews last_review
40560656 The Hoxton, Williamsburg - Cosy Room 273324213 The Hoxton Williamsburg 7/3/2019 Williamsburg Brooklyn 40.72096 -73.95860 Hotel room 2 NA NA 0 75 11/17/2022
41740615 The James New York - NoMad 268417148 The James NoMad 6/13/2019 Midtown Manhattan 40.74459 -73.98574 Hotel room 4 NA NA 0 48 10/10/2022
41740622 Gardens Suites Hotel by Affinia 269311462 Gardens Suites Hotel By Affinia 6/17/2019 Upper East Side Manhattan 40.76442 -73.96303 Hotel room 6 NA NA 0 5 10/1/2022
41792753 Mint House At 70 Pine 197053492 Mint House At 70 Pine 6/21/2018 Financial District Manhattan 40.70651 -74.00777 Hotel room 6 NA NA 0 0
42065543 Broadway Plaza Hotel 307634016 Broadway Plaza 11/6/2019 Midtown Manhattan 40.74444 -73.98920 Hotel room 4 NA NA 0 329 11/4/2022
42065545 Carvi Hotel NYC 310429455 Carvi Hotel 11/18/2019 Midtown Manhattan 40.75917 -73.96926 Hotel room 4 NA NA 0 0
# Basic dataset information
cat("\nDataset Information:\n")
## 
## Dataset Information:
cat("- Number of rows:", nrow(bnb), "\n")
## - Number of rows: 40683
cat("- Number of columns:", ncol(bnb), "\n")
## - Number of columns: 17
# Column types
cat("\nColumn types:\n")
## 
## Column types:
col_types = sapply(bnb, class)
for(col in names(col_types)) {
  cat("- ", col, ": ", col_types[col], "\n")
}
## -  id :  numeric 
## -  name :  character 
## -  host_id :  integer 
## -  host_name :  character 
## -  host_since :  character 
## -  neighbourhood :  character 
## -  neighbourhood_group :  character 
## -  latitude :  numeric 
## -  longitude :  numeric 
## -  room_type :  character 
## -  accommodates :  integer 
## -  bathrooms_text :  character 
## -  bedrooms :  integer 
## -  beds :  integer 
## -  price :  integer 
## -  number_of_reviews :  integer 
## -  last_review :  character

Data Cleaning Strategy

Based on the initial exploration above, we will implement the following cleaning steps:

  1. Handle Missing Values:
    • For bedrooms, beds, and bathrooms: impute using neighborhood and room type medians
    • Remove columns with excessive missing values
  2. Remove Invalid Data:
    • Remove duplicate rows
    • Remove listings with zero or negative prices
  3. Process Special Fields:
    • Convert dates to proper format
    • Process bathroom_text into numeric and sharing status
    • Convert categorical variables to factors
# Clean the dataset
clean_bnb = bnb %>%
  # Remove ID column and duplicates
  select(-id) %>%
  distinct() %>%
  # Remove invalid prices
  filter(price > 0)

# Process bathrooms and handle missing shared values
clean_bnb = clean_bnb %>%
  mutate(
    # Extract bathroom number
    bathrooms = as.integer(sapply(strsplit(bathrooms_text, " "), "[", 1)),
    # Extract shared status, handling NAs
    shared = sapply(strsplit(bathrooms_text, " "), function(x) {
      if(length(x) > 1) {
        ifelse(x[2] == "shared", 1, 0)
      } else {
        0  # Default to not shared if information is missing
      }
    })
  )

# Process dates and handle missing values
clean_bnb = clean_bnb %>%
  mutate(
    # Convert host_since to date
    host_since = mdy(host_since),
    # For missing dates, use the earliest date in the dataset
    host_since = if_else(is.na(host_since), 
                        min(host_since, na.rm = TRUE), 
                        host_since),
    # Convert to integer and create year
    host_since = as.integer(host_since),
    host_since_year = year(as.Date(host_since, origin = "1970-01-01"))
  )

# Remove last_review if it exists
if("last_review" %in% names(clean_bnb)) {
  clean_bnb = clean_bnb %>% select(-last_review)
}

# Impute missing values for important features
clean_bnb = clean_bnb %>%
  group_by(neighbourhood_group, room_type) %>%
  mutate(
    bedrooms = ifelse(is.na(bedrooms), 
                     median(bedrooms, na.rm = TRUE), 
                     bedrooms),
    beds = ifelse(is.na(beds), 
                  median(beds, na.rm = TRUE), 
                  beds),
    bathrooms = ifelse(is.na(bathrooms), 
                      median(bathrooms, na.rm = TRUE), 
                      bathrooms)
  ) %>%
  ungroup()

# Handle any remaining NAs with overall medians
clean_bnb = clean_bnb %>%
  mutate(
    bedrooms = ifelse(is.na(bedrooms), median(bedrooms, na.rm = TRUE), bedrooms),
    beds = ifelse(is.na(beds), median(beds, na.rm = TRUE), beds),
    bathrooms = ifelse(is.na(bathrooms), median(bathrooms, na.rm = TRUE), bathrooms)
  )

# Convert factors
clean_bnb$neighbourhood = as.factor(clean_bnb$neighbourhood)
clean_bnb$neighbourhood_group = as.factor(clean_bnb$neighbourhood_group)
clean_bnb$room_type = as.factor(clean_bnb$room_type)

# Verify cleaning results
cat("\nCleaning Results:\n")
## 
## Cleaning Results:
cat("- Initial rows:", nrow(bnb), "\n")
## - Initial rows: 40683
cat("- Rows after cleaning:", nrow(clean_bnb), "\n")
## - Rows after cleaning: 40647
cat("- Duplicates removed:", nrow(bnb) - nrow(distinct(bnb)), "\n")
## - Duplicates removed: 4
cat("- Zero/negative prices removed:", sum(bnb$price <= 0, na.rm = TRUE), "\n")
## - Zero/negative prices removed: 30
# Check for any remaining missing values
remaining_nas = colSums(is.na(clean_bnb))
if(sum(remaining_nas) > 0) {
  cat("\nRemaining missing values by column:\n")
  print(remaining_nas[remaining_nas > 0])
} else {
  cat("\nNo missing values remaining in the dataset.\n")
}
## 
## No missing values remaining in the dataset.
# Check for any remaining duplicates
remaining_duplicates = sum(duplicated(clean_bnb))
if(remaining_duplicates > 0) {
  cat("\nWARNING: Still found", remaining_duplicates, "duplicate rows. Removing them now.\n")
  clean_bnb = distinct(clean_bnb)
} else {
  cat("\nNo duplicate rows remaining.\n")
}
## 
## WARNING: Still found 1 duplicate rows. Removing them now.
# Replace original dataset with cleaned version
bnb = clean_bnb

# Final verification
cat("\nFinal Dataset Status:\n")
## 
## Final Dataset Status:
cat("- Total rows:", nrow(bnb), "\n")
## - Total rows: 40646
cat("- Total columns:", ncol(bnb), "\n")
## - Total columns: 18
cat("- Missing values:", sum(is.na(bnb)), "\n")
## - Missing values: 0
cat("- Duplicates:", sum(duplicated(bnb)), "\n")
## - Duplicates: 0

Now that we have properly cleaned data, we can proceed with our exploratory analysis.

Exploratory Data Analysis

Borough Analysis

# Count of rentals by borough
boroughs = bnb %>% 
  group_by(neighbourhood_group) %>% 
  summarize(count=n()) %>% 
  arrange(-count)

# Visualize borough distribution
ggplot(boroughs, aes(x = reorder(neighbourhood_group, -count), y = count)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_text(aes(label = count), vjust = -0.5, size = 4) +  # Add count labels
  theme_minimal() +
  labs(title = "Number of Listings by Borough",
       x = "Borough",
       y = "Number of Listings") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +  # Angle borough names for better readability
  scale_y_continuous(expand = expansion(mult = c(0, 0.1)))    # Add some space at the top for labels

# Price analysis by borough
borough_summary_price = bnb %>% 
  group_by(neighbourhood_group) %>%
  summarize(
    min_price = min(price),
    max_price = max(price),
    average_price = mean(price),
    median_price = median(price),
    total_listings = n()
  ) %>% 
  arrange(-average_price)

# Visualize price distribution
ggplot(bnb, aes(x = neighbourhood_group, y = price)) +
  geom_boxplot(fill = "skyblue") +
  theme_minimal() +
  labs(title = "Price Distribution by Borough",
       x = "Borough",
       y = "Price") +
  coord_flip() +
  scale_y_continuous(limits = c(0, quantile(bnb$price, 0.95))) # Remove extreme outliers for better visualization

Property Type Analysis

# Room type distribution by borough
ggplot(bnb, aes(x = neighbourhood_group, fill = room_type)) +
  geom_bar(position = "fill") +
  theme_minimal() +
  labs(title = "Room Type Distribution by Borough",
       x = "Borough",
       y = "Proportion",
       fill = "Room Type") +
  scale_fill_brewer(palette = "Set2")

# Accommodation size analysis
ggplot(bnb, aes(x = accommodates)) +
  geom_histogram(binwidth = 1, fill = "skyblue") +
  facet_wrap(~neighbourhood_group) +
  theme_minimal() +
  labs(title = "Distribution of Accommodation Size by Borough",
       x = "Number of People Accommodated",
       y = "Count")

# Bathroom sharing analysis
bathroom_dist = bnb %>%
  group_by(neighbourhood_group, shared) %>%
  summarize(count = n(), .groups = 'drop') %>%
  mutate(shared = ifelse(shared == 1, "Shared", "Private"))

ggplot(bathroom_dist, aes(x = neighbourhood_group, y = count, fill = shared)) +
  geom_bar(stat = "identity", position = "fill") +
  theme_minimal() +
  labs(title = "Bathroom Sharing Distribution by Borough",
       x = "Borough",
       y = "Proportion",
       fill = "Bathroom Type")

Geographic Distribution and Price Patterns

# Prepare data for mapping
DF.sub = bnb %>%
  mutate(
    Lat = round(latitude, 4),
    Lon = round(longitude, 4),
    logprice = log(price)
  )

# Create aggregated data for visualization
DF.sub.plot = DF.sub %>% 
  group_by(neighbourhood, Lon, Lat) %>% 
  summarize(
    Price = mean(price),
    LogPrice = mean(logprice),
    Listings = n(),
    .groups = 'drop'
  )

Price Distribution Map

# Create color palette for prices
price_pal = colorNumeric(
  palette = "YlOrRd",
  domain = DF.sub.plot$Price
)

# Create interactive map
leaflet(DF.sub.plot) %>% 
  addTiles() %>% 
  addCircleMarkers(
    lng = ~Lon,
    lat = ~Lat,
    radius = ~sqrt(Listings) * 3,
    color = ~price_pal(Price),
    fillOpacity = 0.7,
    popup = ~paste(
      "Neighborhood:", neighbourhood,
      "<br>Average Price: $", round(Price, 2),
      "<br>Number of Listings:", Listings
    )
  ) %>%
  addLegend(
    position = "bottomright",
    pal = price_pal,
    values = ~Price,
    title = "Average Price ($)",
    opacity = 0.7
  )

Time-Based Analysis

# Analyze hosting patterns over time
host_growth = bnb %>%
  mutate(year = host_since_year) %>%
  group_by(year) %>%
  summarize(
    new_hosts = n(),
    avg_price = mean(price, na.rm = TRUE)
  ) %>%
  filter(!is.na(year))

# Visualize host growth
ggplot(host_growth, aes(x = year)) +
  geom_line(aes(y = new_hosts, color = "New Hosts")) +
  geom_line(aes(y = avg_price, color = "Average Price")) +
  scale_y_continuous(
    name = "Number of New Hosts",
    sec.axis = sec_axis(~., name = "Average Price ($)")
  ) +
  theme_minimal() +
  labs(title = "Growth in Hosts and Prices Over Time",
       x = "Year",
       color = "Metric") +
  theme(legend.position = "bottom")

Statistical Analysis and Feature Engineering

Price Distribution Analysis

# Analyze price distribution
ggplot(bnb, aes(x = price)) +
  geom_histogram(bins = 50, fill = "skyblue") +
  theme_minimal() +
  labs(title = "Distribution of Listing Prices",
       x = "Price ($)",
       y = "Count") +
  scale_x_continuous(limits = c(0, quantile(bnb$price, 0.95)))

# Log-transformed price distribution
ggplot(bnb, aes(x = log(price))) +
  geom_histogram(bins = 50, fill = "skyblue") +
  theme_minimal() +
  labs(title = "Distribution of Log-Transformed Prices",
       x = "Log(Price)",
       y = "Count")

# Summary statistics by borough
price_stats = bnb %>%
  group_by(neighbourhood_group) %>%
  summarize(
    mean_price = mean(price),
    median_price = median(price),
    sd_price = sd(price),
    skewness = mean((price - mean(price))^3) / sd(price)^3,
    kurtosis = mean((price - mean(price))^4) / sd(price)^4 - 3
  )

Feature Engineering Insights

# Price per Accommodation
bnb = bnb %>%
  mutate(
    price_per_person = price / accommodates,
    bed_efficiency = beds / accommodates
  )

# Analyze price per person by room type
ggplot(bnb, aes(x = room_type, y = price_per_person)) +
  geom_boxplot(fill = "skyblue") +
  theme_minimal() +
  labs(title = "Price per Person by Room Type",
       x = "Room Type",
       y = "Price per Person ($)") +
  coord_flip() +
  scale_y_continuous(limits = c(0, quantile(bnb$price_per_person, 0.95)))

# Analyze price by listing age
ggplot(bnb, aes(x = 2024 - host_since_year, y = price)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "loess", color = "red") +
  theme_minimal() +
  labs(title = "Price vs Listing Age",
       x = "Listing Age (Years)",
       y = "Price ($)") +
  scale_y_continuous(limits = c(0, quantile(bnb$price, 0.95)))

# Analyze room capacity utilization
ggplot(bnb, aes(x = bed_efficiency)) +
  geom_histogram(bins = 30, fill = "skyblue") +
  facet_wrap(~room_type) +
  theme_minimal() +
  labs(title = "Bed Efficiency by Room Type",
       x = "Beds per Person Accommodated",
       y = "Count")

Correlation Analysis

# Select numerical variables
numerical_vars = bnb %>%
  select(price, accommodates, bedrooms, beds, bathrooms,
         number_of_reviews, host_since_year,
         latitude, longitude)

# Calculate correlation matrix
correlation_matrix = cor(numerical_vars, use = "complete.obs")

# Convert correlation matrix to long format for ggplot
corr_long = melt(correlation_matrix)

# Create correlation plot with ggplot2
ggplot(corr_long, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = sprintf("%.1f", value)), 
            color = ifelse(abs(corr_long$value) > 0.7, "white", "black")) +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                      midpoint = 0, limit = c(-1,1)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title = element_blank()) +
  labs(title = "Correlation Matrix of Numerical Variables",
       fill = "Correlation") +
  coord_fixed()

# Strong correlations analysis (|correlation| > 0.5)
strong_correlations = which(abs(correlation_matrix) > 0.5 & correlation_matrix != 1, arr.ind = TRUE)
strong_corr_df = data.frame(
  Variable1 = rownames(correlation_matrix)[strong_correlations[,1]],
  Variable2 = colnames(correlation_matrix)[strong_correlations[,2]],
  Correlation = correlation_matrix[strong_correlations]
) %>%
  arrange(desc(abs(Correlation)))

# Display strong correlations
knitr::kable(strong_corr_df, 
             caption = "Strong Correlations (|correlation| > 0.5)",
             digits = 2)
Strong Correlations (|correlation| > 0.5)
Variable1 Variable2 Correlation
beds accommodates 0.77
accommodates beds 0.77
beds bedrooms 0.73
bedrooms beds 0.73
bedrooms accommodates 0.69
accommodates bedrooms 0.69

Key Statistical Insights

  1. Price Distribution:
    • Highly right-skewed distribution suggests need for log transformation in modeling
    • Different boroughs show distinct price distributions
    • Presence of outliers needs to be considered in modeling strategy
  2. Correlation Findings:
    • Strong correlation between accommodates, bedrooms, and beds
    • Moderate correlation between price and room capacity variables
    • Weak correlation with review-based features
    • Geographic coordinates show non-linear relationships with price
  3. Feature Engineering Opportunities:
    • Price per person provides normalized comparison across listings
    • Listing age shows non-linear relationship with price
    • Bed efficiency reveals different patterns across room types
    • Location-based features might benefit from clustering or neighborhood aggregation
  4. Modeling Implications:
    • Consider log transformation of price
    • Handle multicollinearity among room capacity variables
    • Include interaction terms for room type and borough
    • Create normalized features for better comparisons
    • Consider non-linear relationships in model selection

Statistical Tests

# 1. ANOVA: Price differences between boroughs
borough_anova = aov(price ~ neighbourhood_group, data = bnb)
borough_anova_summary = summary(borough_anova)

# 2. Chi-square test: Room type distribution across boroughs
room_borough_table = table(bnb$neighbourhood_group, bnb$room_type)
room_borough_chi = chisq.test(room_borough_table)

# 3. T-test: Price difference between shared and private bathrooms
bathroom_ttest = t.test(price ~ shared, data = bnb)

# 4. Correlation test for price vs number of reviews
review_cor = cor.test(bnb$price, bnb$number_of_reviews)

# 5. Linear regression for price vs accommodates
price_accom_lm = lm(price ~ accommodates, data = bnb)
price_accom_summary = summary(price_accom_lm)

# Display results
cat("Statistical Test Results:\n\n")
## Statistical Test Results:
cat("1. Borough Price Differences (ANOVA):\n")
## 1. Borough Price Differences (ANOVA):
cat("   F-statistic:", round(borough_anova_summary[[1]]$`F value`[1], 2), "\n")
##    F-statistic: 933.16
cat("   p-value:", format.pval(borough_anova_summary[[1]]$`Pr(>F)`[1]), "\n\n")
##    p-value: < 2.22e-16
cat("2. Room Type Distribution (Chi-square):\n")
## 2. Room Type Distribution (Chi-square):
cat("   Chi-square statistic:", round(room_borough_chi$statistic, 2), "\n")
##    Chi-square statistic: 1054.97
cat("   p-value:", format.pval(room_borough_chi$p.value), "\n\n")
##    p-value: < 2.22e-16
cat("3. Bathroom Type Price Difference (T-test):\n")
## 3. Bathroom Type Price Difference (T-test):
cat("   t-statistic:", round(bathroom_ttest$statistic, 2), "\n")
##    t-statistic: 112.29
cat("   p-value:", format.pval(bathroom_ttest$p.value), "\n\n")
##    p-value: < 2.22e-16
cat("4. Price-Reviews Correlation:\n")
## 4. Price-Reviews Correlation:
cat("   Correlation coefficient:", round(review_cor$estimate, 3), "\n")
##    Correlation coefficient: -0.011
cat("   p-value:", format.pval(review_cor$p.value), "\n\n")
##    p-value: 0.031109
cat("5. Price-Accommodates Relationship:\n")
## 5. Price-Accommodates Relationship:
cat("   R-squared:", round(price_accom_summary$r.squared, 3), "\n")
##    R-squared: 0.215
cat("   p-value:", format.pval(price_accom_summary$coefficients[2,4]), "\n")
##    p-value: < 2.22e-16

Statistical Test Interpretations

  1. Borough Price Differences (ANOVA):
    • Significant differences in prices between boroughs (p < 0.001)
    • Validates borough-specific pricing strategies
    • Supports separate modeling approach by borough
  2. Room Type Distribution (Chi-square):
    • Non-random distribution of room types across boroughs
    • Different boroughs have distinct property type preferences
    • Implications for market positioning
  3. Bathroom Type Price Impact (T-test):
    • Significant price difference between shared/private bathrooms
    • Quantifies the “bathroom premium”
    • Important for pricing recommendations
  4. Price-Reviews Relationship:
    • Weak correlation between price and review count
    • Suggests reviews might not be strong price driver
    • May indicate market inefficiency
  5. Price-Accommodates Relationship:
    • Strong linear relationship with price
    • Explains significant portion of price variance
    • Validates use as key predictor in modeling

These statistical tests provide: - Validation of observed patterns - Quantification of relationships - Evidence-based decision support - Foundation for modeling choices

Key EDA Findings

  1. Borough-Specific Patterns: Each borough shows distinct pricing patterns and market characteristics.

  2. Bathroom Impact: The presence of private vs. shared bathrooms significantly affects pricing across all boroughs.

  3. Important Features:

    • Number of accommodates varies significantly by borough
    • Room type distribution differs across boroughs
    • Bathroom configuration shows clear patterns
    • Location (neighborhood) is a key factor
    • Review history varies by area
  4. Geographic Insights:

    • Clear price clusters exist within boroughs
    • Certain neighborhoods consistently command higher prices
    • Listing density varies significantly by area
  5. Temporal Patterns:

    • Host growth shows clear trends
    • Price evolution varies over time
    • Seasonal patterns are evident in some areas